Text editors &emp; Memory Management

TAD

Introduction

This article was inspired by a very recent conversation with a good friend (yow Beerhunter!) concerning text editors, memory management and various schemes to deal with very, very large files using memory and disk based buffers. It will look at the limitations of some text editors, the problems with removing some of those limitations and offer a (hopefully) useful and easy to understand solution.

The big picture

For many people (even perhaps experienced programmers) creating a text editor appears to be a very trivial task. On the face of it, it is. You only need to move around a file/buffer and insert or delete characters. I've written some tiny text file viewers in less than 150 bytes (low-level access, expanding TABs and dealing with CURSORS, HOME, END ,PGUP, PGDN keys). So it seemed, that coding a text editor is a simple task, but it's the insert and delete which can lead to all sorts of complexities and limitations as we will soon see.

Okay, let's look at what the problem is.

A purely memory-based text editor is simple enough to do and this is what many people code. You only need a fast block-move routine and a quick machine and you have solved the insert/delete problem. Because all of the file is in memory at the same time there should be very little delay between typing a character and seeing the result on screen. Of course if you are editing a multi-megabyte file and changing the very first line, then you might notice a small delay after each keypress because the rest of the file would need to move up, or down, in memory as you delete, or insert, characters.

Fragments

Okay, most of you have already thought that using some non-continuous memory scheme would be a solution to the memory-move delay problem. For example, breaking the entire text file into a series of strings could be one idea. Using nodes/linked-lists is a second idea. Thinking of the text document as memory and employing a memory allocation scheme seems like a good way to go. But, as we will see later on, the insert/delete operations will cause a few headaches.

Breaking the text document into a series of lines is the most obvious method for fragmenting a large text file. When you edit the first line, you only need to memory-move that particular line (string).

Of course any memory allocation, re-allocation and free operations must (at some point in time) deal with the normal house-keeping chores (like de-frag, concating neighbouring blocks of released memory and relocation of the blocks). Sooner or later that first line will out-grow the memory block you've allocated for it.

Another point worth dealing with, is, what size should these fragments be? Make them too small and you will have to realloc/defrag sooner. Make them too big and you will be wasting memory for lines with only a few characters on them. In the worse case you could be allocating a 8 KB chunk for each and every blank line.

Line length limitation

Breaking the text document into a series of lines (strings) seems like a very popular method. But let's consider a text document with 1000000+ characters on the same line. What happens when you attempt to load it into memory? Of course you would split it into a number of lines, each line having some pre-determined maximum length (for example, 8192).

This is a very sensible solution, but can introduce some complexity later on when editing. For example, what happens when you delete the EOL (CR-LF or LF) characters at the end of a 8192 character line? It should join the next line to the end of the current line, but this breaks the maximum line length limit.

But, suppose you wanted to allow an unlimited length line?

This question generated a lot of ideas in the conversation. What would be the best method to use to support an unlimited line length? The issues of speed and simplicity were the main driving forces behind the scribbles on a few sheets of paper.

The Objectives

There were a few goals in mind while discussing which solution would be best.

1) The method must be simple and fast.

2) An unlimited text document (and line length) size must be supported.

3) The solution must support some virtual memory like operations. So that only part of the text file being editing needs to be in memory, the rest of it can exist on disk in a temporary swap file.

4) Memory-move (insert & delete) operations should be kept to a minimum.

5) De-fragmentation should be avoided (if possible).

6) Typing single characters (insert/delete) should be as fast as possible.

Fixed-sized blocks

The first step was to find a way to split up the text document into easy to manage parts. The method we came up with was to use blocks of a fixed size (such as 4096, 16384 or 32768 byte blocks).

Okay, each chunk would have a small data structure with the fixed size buffer area. The structure would contain the normal node elements (next node pointer, previous node pointer) together with some flags.

Let's use some 4 byte chunks to see how they would work.

  chunk 0     chunk 1      chunk 2       chunk 3 
 ÚÄÂÄÂÄÂÄ¿   ÚÄÂÄÂÄÂÄ¿    ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿ 
 ³A³B³C³D³ Ú>³E³F³G³H³ ÚÄ>³I³J³K³L³  ÚÄ>³M³N³O³P³ 
 ÀÄÁÄÁÄÁÂÙ ³ ÀÄÁÄÁÄÁÂÙ ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÄÙ 
        ÀÄÄÙ        ÀÄÄÙ         ÀÄÄÄÙ 
       next link    next link    next link 

Each chunk could either be in memory, or stored on disk in a temporary swap-file. The 'next link' would probably be best done using a virtual-pointer data structure something like this:

 virtlink        STRUC 
    Position             dd      ?    ; memory address (in memory) 
                                      ; or, file position (on disk) 
    Flags                dd      ? 
         VM_IN_MEMORY_F  equ     00000001h 
 virtlink        ENDS 

Whenever we use a virt-link to follow a link to the next chunk we first check to see if that chunk is already in memory. If not, we perform the swap-file <---> memory operation and load it into memory and of course, mark the chunk as VM_IN_MEMORY_F.

Inserting memory

Suppose we want to insert a new chunk 'WXYZ' before the 'I' in chunk 2. This would be a standard node insertion operation.

Before:

  chunk 0     chunk 1      chunk 2       chunk 3 
 ÚÄÂÄÂÄÂÄ¿   ÚÄÂÄÂÄÂÄ¿    ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿ 
 ³A³B³C³D³ Ú>³E³F³G³H³ ÚÄ>³I³J³K³L³  ÚÄ>³M³N³O³P³ 
 ÀÄÁÄÁÄÁÂÙ ³ ÀÄÁÄÁÄÁÂÙ ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÄÙ 
        ÀÄÄÙ        ÀÄÄÙ         ÀÄÄÄÙ 
       next link    next link    next link 

After:

  chunk 0     chunk 1     ** NEW **      chunk 2       chunk 3 
 ÚÄÂÄÂÄÂÄ¿   ÚÄÂÄÂÄÂÄ¿    ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿ 
 ³A³B³C³D³ Ú>³E³F³G³H³ ÚÄ>³W³X³Y³Z³  ÚÄ>³I³J³K³L³  ÚÄ>³M³N³O³P³ 
 ÀÄÁÄÁÄÁÂÙ ³ ÀÄÁÄÁÄÁÂÙ ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÄÙ 
        ÀÄÄÙ        ÀÄÄÙ         ÀÄÄÄÙ         ÀÄÄÄÙ 
                    next link    next link 

This would give a blindingly fast insert (even if we were on the first line of a 50 meg text document.

Inserting memory (2)

Sure, that was too easy. In the above example we used a 4-byte insert (exactly the same size as our fixed-size chunks).

Okay, its time to take the bull by the horns and return to the thorny problem of insert and delete. Splitting the document up into fixed-size chunks allowed easy virtual memory management, but it still leaves us with the problem of moving vast blocks of memory around.

Instead of inserting a 4-byte section, let's perform a 1-byte insert between 'F' and 'G' of chunk 1.

  chunk 0     chunk 1      chunk 2       chunk 3 
 ÚÄÂÄÂÄÂÄ¿   ÚÄÂÄÂÄÂÄ¿    ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿ 
 ³A³B³C³D³ Ú>³E³F³G³H³ ÚÄ>³I³J³K³L³  ÚÄ>³M³N³O³P³ 
 ÀÄÁÄÁÄÁÂÙ ³ ÀÄÁÄÁÄÁÂÙ ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÄÙ 
        ÀÄÄÙ     ^  ÀÄÄÙ         ÀÄÄÄÙ 
                 insert here + memory move ----> 

What happens when we're trying to insert a non-chunk sized block of data?

At first this problem looks like it will take us all the way back to square one; that of having to move a huge block of memory about for each insert/delete. It seems like we must memory-move all the letter GHIJK....OP to insert a single byte. To make matters worse the memory move would also now have to deal with chunk boundaries.

Finding the key(board) to the solution.

Thinking of a single chunk as a block of memory that we want to scroll will give you a big hint as where we're going with the chunk problem. Well, perhaps 'scroll' is the wrong word here, but thinking in terms of a keyboard buffer will help.

We extend the chunk structure to include a 'Head' and 'Tail' index position. These new variables allow us to break each chunk into small fragments when and where we need to.

Again, let's insert 'X' between 'F' and 'G' of chunk 1.

  chunk 0     chunk 1       chunk 2       chunk 3 
 ÚÄÂÄÂÄÂÄ¿   ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿ 
 ³A³B³C³D³ Ú>³E³F³X³-³  ÚÄ>³I³J³K³L³  ÚÄ>³M³N³O³P³ 
 ÀÄÁÄÁÄÁÂÙ ³ ÀÄÁÄÁÂÁÄÙ  ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÄÙ 
        ÀÄÄÙ      ³     ³         ÀÄÄÄÙ 
                  ³     ³ 
            split ³     ³ 
                  ³  ÚÄÂÁÂÄÂÄ¿ newly created chunk 
                  ÀÄ>³G³H³-³-³ filled with 'GH' from 
                     ÀÄÁÄÁÄÁÄÙ old chunk 1. 

We split the chunk into two, copy the trailing chunk bytes into the newly created chunk then add the our 'X' byte into the old chunk.

Of course this creates a small memory overhead (another chunk structure + the Head and Tail variables in each chunk), but it does solve the problem of having to shift lots of memory about. In the worse case we need to copy ChunkSIZE-1 bytes even if it is the very first line of a 50 meg text document. So from a memory-move point of view, we're only dealing with a (for example) 8 KB document.

And it gets better.

After the split we can simply add characters/bytes to the end of the broken chunk until we reach the ChunkSIZE limit. And that means absolutely NO block copying!! Lets type a 'Y' into the document after the 'X' character.

  chunk 0     chunk 1       chunk 2       chunk 3 
 ÚÄÂÄÂÄÂÄ¿   ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿ 
 ³A³B³C³D³ Ú>³E³F³X³Y³  ÚÄ>³I³J³K³L³  ÚÄ>³M³N³O³P³ 
 ÀÄÁÄÁÄÁÂÙ ³ ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÄÙ 
        ÀÄÄÙ        ³   ³         ÀÄÄÄÙ 
                  ÚÄÙ   ³ 
      add 'Y' by  ³     ³ 
   updating the   ³  ÚÄÂÁÂÄÂÄ¿ 
   'Tail' index   ÀÄ>³G³H³-³-³ 
    of chunk 1       ÀÄÁÄÁÄÁÄÙ 

After this point we create a new, empty, chunk and insert the next character/byte 'Z' at index position 0 (see below).

  chunk 0     chunk 1       chunk 2       chunk 3 
 ÚÄÂÄÂÄÂÄ¿   ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿ 
 ³A³B³C³D³ Ú>³E³F³X³Y³  ÚÄ>³I³J³K³L³  ÚÄ>³M³N³O³P³ 
 ÀÄÁÄÁÄÁÂÙ ³ ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÄÙ 
        ÀÄÄÙ        ³   ³         ÀÄÄÄÙ 
                    ³   ³ 
  ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ   ³ 
  ³  ÚÄÂÄÂÄÂÄ¿       ÚÄÂÁÂÄÂÄ¿ 
  ÀÄ>³Z³-³-³-³  ÚÄÄÄ>³G³H³-³-³ 
     ÀÂÁÄÁÄÁÄÙ  ³    ÀÄÁÄÁÄÁÄÙ 
      ÀÄÄÄÄÄÄÄÄÄÙ 

We can continue to 'insert' new characters after the 'Z' until again the ChunkSIZE limit is reached when the process repeats again with another new chunk.

Search and Replace

Like most data operations, text editing is done in small, local areas. You edit a line at time, scroll up and down a few pages, then edit another line. Even with multiple windows open you rarely change the entire document. That is, except when you are performing global search and replace operations. In the worst possible case you could replace every 'a' with 'bb' in your 10 meg file. This would involve a great deal of chunk splitting and data copying.

One way to help speed things up would be to use a source and destination chunk scheme, where you read from one chunk and write the replacement characters into the second one. Once a chunk is filled you would need a scheme to mark it as the new chunk and link it back to the original node chain (probably keeping the old source chunk and updating the head index to its remaining characters).

For example, say we want to replace 'A' with 'BC':

  chunk 0     chunk 1       chunk 2       chunk 3 
 ÚÄÂÄÂÄÂÄ¿   ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿ 
 ³m³n³o³p³   ³-³-³A³A³  ÚÄ>³A³A³A³A³  ÚÄ>³w³x³y³z³ 
 ÀÄÁÄÁÄÁÂÙ   ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÄÙ 
        ³         ^ ÀÄÄÄÙ         ÀÄÄÄÙ 
        ³         ³ 
        ³         ÀÄÄÄÄ¿  result after replacing 2 'A's 
        ³    ÚÄÂÄÂÄÂÄ¿ ³  from chunk 1 (the --) with 
        ÀÄÄÄ>³B³C³B³C³ ³  2 'BC' (the new chunk) 
             ÀÄÁÄÁÄÁÂÙ ³ 
              new   ÀÄÄÙ 

As you can see above, we have filled up our 'new' chunk but there are still characters remaining the old source chunk 1. If we continue to replace 2 more 'A' characters, we get:

  chunk 0     chunk 1       chunk 2       chunk 3 
 ÚÄÂÄÂÄÂÄ¿   ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿     ÚÄÂÄÂÄÂÄ¿ 
 ³m³n³o³p³   ³-³-³-³-³  ÚÄ>³A³A³A³A³  ÚÄ>³w³x³y³z³ 
 ÀÄÁÄÁÄÁÂÙ   ÀÄÁÄÁÄÁÄÙ  ³  ÀÄÁÄÁÄÁÂÙ  ³  ÀÄÁÄÁÄÁÄÙ 
        ³               ³         ÀÄÄÄÙ 
        ³               ³ 
        ³               ÀÄÄÄÄÄÄÄÄÄÄ¿ 
        ³    ÚÄÂÄÂÄÂÄ¿    ÚÄÂÄÂÄÂÄ¿³ 
        ÀÄÄÄ>³B³C³B³C³ ÚÄ>³B³C³B³C³³ 
             ÀÄÁÄÁÄÁÂÙ ³  ÀÄÁÄÁÄÁÂÙ³ 
              new   ÀÄÄÙ   new 2 ÀÄÙ 

We could then place the old chunk 1 back into the free pool of chunks, ready to be allocated later.

As you can hopefully see, this method would be faster when performing an 'insert-replace' (where the replacement string is longer than the existing string). Because you don't need to perform a chunk-scroll (ie. character insert) to make space for the 'C' character.

When the replacement string is the same length as the existing one, we can simply overwrite bytes in each chunk.

If the replacement string is smaller, then you could still use the source and destination method, but you will use fewer and fewer chunks the more you replace.

Closing Words (pun intended)

Oh well, I hope you've found this an interesting problem. I bet most of you had never really thought about this problem in any great detail before, I must admit, until a few days again I wouldn't had thought it could be so complex.

But is all this complexity really needed on today's PCs when most people have got 128+ MB of memory and a very fast CPU?

Well, maybe not, but it's fun to how complex something as simple as a text editor can get and to look for a solution.

Happy typing,

TAD